Efficient algorithms for exact hierarchical clustering of huge datasets: Tackling the entire protein space

نویسندگان

  • Yaniv Loewenstein
  • Elon Portugaly
  • Menachem Fromer
چکیده

Motivation: UPGMA (average-linkage clustering) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. UPGMA however, is a complete-linkage method, in the sense that all edges between data points are needed in memory. Due to this prohibitive memory requirement UPGMA is not scalable for very large datasets. Results: We present novel memory-constrained UPGMA (MCUPGMA) algorithms. Given a constrained memory size, our algorithm guarantees the exact same UPGMA clustering solution, without explicitly holding all edges in memory. Our algorithms are general, and applicable to any dataset. We present a theoretical characterization of the algorithm efficiency, and hardness for various data. We show the performance of our algorithm , under restricted memory constraints. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a novel evolutionary tree of all proteins using no prior knowledge. We show that newly created tree captures protein families better than state-of-the-art large scale methods such as CluSTr, ProtoNet4, or single-linkage clustering. The robustness of UPGMA improves significantly on existing methods, especially for multi-domain proteins, and for large or divergent families. Our algorithm is scalable to any feasible increase in sequence databse sizes. Availability: The evolutionary tree of all proteins in the entire UniProt set, together with navigation and classification tools will be made available as part the ProtoNet service. A C++ implementation of the algorithm, suitable for any type or size a data, is available. Contact: [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

MOTIVATION UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. APPLICATION We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any pr...

متن کامل

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

Graph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members

Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008